NSF PAR Search | NSF Public Access Repository

VIVAR: learning view-invariant embedding for video action recognition

https://doi.org/10.1117/12.3059138

Hasan, Zahid; Ahmed, Masud; Faridee, Abu_Zaher Md; Purushotham, Sanjay; Lee, Hyungtae; Kwon, Heesung; Roy, Nirmalya (March 2025, SPIE)

Liang, Xuefeng (Ed.)

Deep learning has achieved state-of-the-art video action recognition (VAR) performance by comprehending action-related features from raw video. However, these models often learn to jointly encode auxiliary view (viewpoints and sensor properties) information with primary action features, leading to performance degradation under novel views and security concerns by revealing sensor types and locations. Here, we systematically study these shortcomings of VAR models and develop a novel approach, VIVAR, to learn view-invariant spatiotemporal action features removing view information. In particular, we leverage contrastive learning to separate actions and jointly optimize adversarial loss that aligns view distributions to remove auxiliary view information in the deep embedding space using the unlabeled synchronous multiview (MV) video to learn view-invariant VAR system. We evaluate VIVAR using our in-house large-scale time synchronous MV video dataset containing 10 actions with three angular viewpoints and sensors in diverse environments. VIVAR successfully captures view-invariant action features, improves inter and intra-action clusters’ quality, and outperforms SoTA models consistently with 8% more accuracy. We additionally perform extensive studies with our datasets, model architectures, multiple contrastive learning, and view distribution alignments to provide VIVAR insights. We open-source our code and dataset to facilitate further research in view-invariant systems.

Free, publicly-accessible full text available March 10, 2026

Search for: All records